[Dist][Release] Use one wheel for different CUDA version#1826
[Dist][Release] Use one wheel for different CUDA version#1826oraluben wants to merge 20 commits intotile-ai:mainfrom
Conversation
|
👋 Hi! Thank you for contributing to the TileLang project. Please remember to run We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀 |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughRemoved nightly-specific CI branches and simplified CUDA/NVRTC/CUDART stub loading to probe global symbols first via dlsym(RTLD_DEFAULT) then dlsym(RTLD_NEXT), emitting an error and aborting if target symbols are absent. Changes
Sequence Diagram(s)sequenceDiagram
participant App as "App / Stub"
participant Linker as "Dynamic Linker\n(RTLD_DEFAULT / RTLD_NEXT)"
participant Lib as "CUDA lib\n(if loaded)"
App->>Linker: dlsym(RTLD_DEFAULT, "symbol")
alt found in RTLD_DEFAULT
Linker-->>App: symbol pointer
App->>Lib: call via global symbol (no dlopen)
else not found
App->>Linker: dlsym(RTLD_NEXT, "symbol")
alt found in RTLD_NEXT
Linker-->>App: symbol pointer
App->>Lib: call via next-symbol resolver
else not found
Linker-->>App: NULL
App->>App: print error and abort()
end
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
.github/workflows/dist.yml (1)
117-126:⚠️ Potential issue | 🟡 MinorWheels only validated with basic importability — cross-CUDA runtime testing missing.
The
dist.ymlworkflow builds wheels with CUDA 12.8 (lines 118–120) and only tests basic importability:import tilelang(line 203). If wheels built with CUDA 12.8 are claimed to work with CUDA 11.8 and CUDA 13, no CI workflow validates this. Theci.ymlworkflow also tests only CUDA 12.8, with no matrix entries for other CUDA versions.
🤖 Fix all issues with AI agents
In `@src/target/stubs/cudart.cc`:
- Around line 64-75: TryLoadLibCudart currently returns RTLD_DEFAULT which can
resolve to the stub's own exported cudaGetErrorString causing infinite recursion
and also fails to dlopen libcudart from disk; update TryLoadLibCudart to first
probe with dlsym(RTLD_NEXT, "cudaGetErrorString") (or check a sentinel boolean
that marks "in-stub" symbols to avoid self-resolution) and if that fails attempt
to dlopen common libcudart filenames (e.g., "libcudart.so", versioned names)
with RTLD_NOW|RTLD_LOCAL and return that handle; ensure CreateCUDARuntimeAPI
still uses the returned handle (not the stub) so function pointers point to real
libcudart symbols and not the stub to avoid recursion.
In `@src/target/stubs/nvrtc.cc`:
- Around line 40-50: Replace the fragile dlsym(RTLD_DEFAULT, "nvrtcVersion")
check in TryLoadLibNvrtc with a robust load/check sequence: avoid using
RTLD_DEFAULT (which can return the stub's own nvrtcVersion) and instead first
try dlopen with RTLD_NOLOAD on likely lib names to detect an already-loaded
libnvrtc, then if not found attempt dlopen of the library file (e.g.,
"libnvrtc.so" variants) to load it from disk; use the returned dlopen handle and
dlsym(handle, "nvrtcVersion") to confirm the symbol, and return that handle (or
nullptr on failure) — update TryLoadLibNvrtc and all callers to use the dlopen
handle rather than RTLD_DEFAULT to prevent self-resolution and allow standalone
loading.
🧹 Nitpick comments (1)
.github/workflows/dist.yml (1)
187-208: Wheel test only checks importability — no torch/CUDA runtime validation.The test step installs the wheel and runs
import tilelang, but doesn't install or test with any PyTorch version. Given the new stub design relies on PyTorch having loaded libcudart/libnvrtc into the global namespace, this test won't exercise the dynamic loading path at all. Consider adding at least a smoke test that installs torch and verifies CUDA stub resolution succeeds.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@src/target/stubs/cudart.cc`:
- Around line 76-79: The current error path uses __builtin_unreachable() which
is only an optimization hint and doesn't guarantee termination; update the error
handling after the fprintf in the TileLang libcudart check to call abort() (from
stdlib.h) before the __builtin_unreachable() (or replace the builtin call with
abort()) so the process deterministically terminates and the error message is
visible; target the fprintf(...) statement and the following
__builtin_unreachable() call in cudart.cc when making this change.
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/target/stubs/cudart.cc (1)
14-16:⚠️ Potential issue | 🟡 MinorStale doc comment —
dlopen()is no longer used.Lines 15–16 say "The real libcudart is loaded lazily via dlopen() on first API call, and symbols are resolved via dlsym()." After this PR,
dlopenis removed entirely; symbols are found viadlsym(RTLD_DEFAULT/RTLD_NEXT, …). Update the header to reflect the new strategy.src/target/stubs/nvrtc.cc (1)
13-14:⚠️ Potential issue | 🟡 MinorStale doc comment —
dlopen()is no longer used.Lines 13–14 reference lazy loading via
dlopen(), but this PR replaces that withdlsym(RTLD_DEFAULT/RTLD_NEXT, …). Update accordingly.
🤖 Fix all issues with AI agents
In `@src/target/stubs/cudart.cc`:
- Around line 66-85: TryLoadLibCudart can return RTLD_DEFAULT (which is
((void*)0) on glibc) or RTLD_NEXT, so the nullptr check in CreateCUDARuntimeAPI
is invalid and causes a spurious "missing library" path; modify the code so
CreateCUDARuntimeAPI does not treat a returned RTLD_DEFAULT as failure — either
remove the handle==nullptr guard entirely or replace it with an explicit check
that treats RTLD_DEFAULT and RTLD_NEXT as valid sentinel handles (e.g., compare
against RTLD_DEFAULT/RTLD_NEXT or use a dedicated bool flag returned by
TryLoadLibCudart); update the logic around TryLoadLibCudart,
CreateCUDARuntimeAPI, and any callers of CUDARuntimeAPI/MissingLibraryError to
rely on that explicit sentinel instead of nullptr.
In `@src/target/stubs/nvrtc.cc`:
- Around line 42-60: The check in CreateNVRTCAPI that treats a nullptr handle as
a failure must be removed because TryLoadLibNvrtc() aborts on error and may
legitimately return RTLD_DEFAULT which is (void*)0 on glibc; update
CreateNVRTCAPI to accept the handle returned by TryLoadLibNvrtc() (including
RTLD_DEFAULT) and proceed to load nvrtc symbols (references: TryLoadLibNvrtc,
CreateNVRTCAPI, NVRTCAPI, nvrtcVersion, RTLD_DEFAULT, RTLD_NEXT) instead of
returning an empty NVRTCAPI when handle == nullptr.
…. This change simplifies the API initialization process in both cudart.cc and nvrtc.cc, ensuring that the function pointer lookups proceed without early returns.
…zy loading mechanism. Update documentation to clarify the purpose and implementation of stubs for better understanding.
Followup of #1821
Validate that wheels built with CUDA 12.8 works with CUDA 11.8 (torch 2.4) and CUDA 13 (torch 2.10)
After this PR, libcudart from host toolchain is not loaded.
for libnvrtc, both host lib and pip lib are loaded by torch. this behavior stays same from torch 2.4 to 2.10.
Summary by CodeRabbit
Chores
Refactor
Documentation